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The profusion of online news articles makes it difficult to find interesting articles, a problem that can be 
assuaged by using a recommender system to bring the most relevant news stories to readers. However, news 
recommendation is challenging because the most relevant articles are often new content seen by few users. 
In addition, they are subject to trends and preference changes over time, and in many cases we do not have 
sufficient information to profile the reader. 

In this paper, we introduce a class of news recommendation systems based on context trees. They can 
provide high-quality news recommendation to anonymous visitors based on present browsing behaviour. 
We show that context-tree recommender systems provide good prediction accuracy and recommendation 
novelty, and they are sufficiently flexible to capture the unique properties of news articles. 



1. INTRODUCTION 

With a growing number of online news stories, it has become difficult to find interesting 
articles. Recommender systems alleviate this issue by bringing the most relevant items 
to users. However, while such systems have been used with considerable success for 
products such as books and movies, they have found surprisingly little application in 
recommending news articles. 

There are many challenges in finding the most relevant news stories of superior 
quality to recommend to readers. Firstly, articles must be recommended soon after 
they are written, hence leaving little time to collect data about their popularity. A 
second issue is that there is often little data available about a user's past behaviour 
because many news sites are reached through search engines, so visitors cannot be 
identified. Finally, recommendations for news stories depend on a number of factors: 
the popularity of a news item, the freshness of the story, the topic and the sequence of 
news items or topics t hat the user has seen so far. 

Current approaches MDas et al.ll2007l : llJntema et al.ll2010n apply techniques designed 
for product recommendation to the domain of news articles. However, in doing so they 
ignore the intrinsic properties of news stories. To overcome the lack of data about the 
users, they frequently rely on the history of logged-in users, which causes potential 
privacy issues to the users. We believe this is why many newspaper websites continue 
to recommend news articles using a simple most-popular approach. 

Our contribution is a class of online recommendation algorithms based on Context- 
Tree (CT) models. The online nature of the algorithms means that the model pro- 
vides recommendations and is updated simultaneously and fully incrementally. Con- 
text trees are a versatile class of Bayesian statistical models. A CT model defines a 
partition tree in some space, where each subset of each partition is called a context. 
Each context is associated with a different local prediction model, called an expert, 
which are then combined to make predictions. It is important to select both the parti- 
tion structure and the expert model appropriately. 

In this work, we consider different spaces to partition: a) sequences of news, b) 
sequences of topics, and c) topic distributions. The first two constructions estimate 
distributions of variable-order Markov models, and so concentrate on modelling the 
temporal characteristics of users' behaviour. The last model instead makes predictions 
conditional on the distribution of topics preferred by a user. Each of these constructions 
results in a different behavioural model. 
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By assigning an expert to each context, the CT model defines a tree distribution on 
expert models. Each expert model gives predictions for a subset of the data space, i.e. 
the context the expert is responsible for. The expert predictions are combined to make 
recommendations. We tailor our expert models to the idiosyncrasies of news. More 
specifically, our expert models take into account the popularity and freshness of news 
items. 

In all cases, the CT distribution admits a closed-form, incremental Bayesian infer- 
ence procedure. Hence, in contrast to other methods, our approach can easily be em- 
ployed online to simultaneously generate recommendations and update the model. 

The questions we want to answer are whether a) a sophisticated expert model can 
improve recommendation quality, b) the temporal sequence is important for recom- 
mendation, c) the content helps in making good recommendations, and d) CT models 
give novel recommendations. 

To answer these questions, we examine a scenario where users visit a website anony- 
mously and only information about the current visit can be used to make recommen- 
dations. We obtained access logs from two newspaper websites: Tribune de Geneve and 
24 Heure^ (the most-popular newspapers in Cantons of Geneva and Vaud, respec- 
tively). As recommendations are difficult to evaluate with real users, we measure how 
well our recommendations match the news items that readers selected themselves. 

We show that context-tree recommender systems have a robust performance, sur- 
passing that of a standard approach over a wide range of parameters. In addition, we 
performed an independent unbiased test where we show that CT methods achieve good 
performance for both accuracy of prediction and novelty of recommendations. 

The remainder of this paper begins with a brief review of the literature. In Section[3l 
we introduce the general idea of context-tree recommender systems, and define recom- 
mender systems building context tree in the sequenc e of ite ms (Sec. |3.1.Tl l. sequence of 
topics (Sec. l3.1.2] l and an hybrid of the fo rmer s (Sec. l3.1.3]l . We also show how to take 
advantage of the topic distributions (Sec. [3]2]l. Section [331 characterizes the different 
expert models tailored for the domain of news. We present and discuss our results in 
Section m Finally, we conclude our work in Section [H 

2. RELATED WORK 

In this work, we use recommender systems to suggest relevant and interesting 
news articles to readers. In gen eral, there are two class es of recommender systems 
IIAdomavicius and Tuzhiliiil 12005(1 : collaborative filtering IISu and KhoshgoftaaHl2009[l 
whic h recommend item s based on preferences of similar users, and content-based sys- 
tems MLops et al.ll201lll which use content similarity of the items. 

The earliest example where collaborative filtering is used for news recommenda- 
tion is the Grouple ns project which applies it to newsgroups llResnick et al.ljl994 : 
IKonstan et al.ll 199711 . News aggregation systems such as Google News llDas et al."2007n 
also implement such algorithms. In their work, they use Probabilistic Latent Semantic 
Indexing and MinHash for clustering news items, and item covisitation for recommen- 
dation (i.e. where two news are clicked by the same user within a time frame). Their 
system builds a graph in which the nodes are the news stories and the edges represent 
the number of covisitations. Each of the approaches generates a score for a given news, 
which are aggregated into single score thanks to a linear combination. 

Content-based recomm e ndation is more common for news personalisation 
IIBillsus and Pazzanil [l999l: lAhn et all IIoOTI: ILJntema et aT '2QlO; lAbel et al.l [MH . 
NewsWeeder IlLan^l 199511 is probably the first co nten t-based a pproach for recommen- 
dations, but applied to newsgroups. NewsDude IIBillsus and Pazzanill 1999(1 and more 



^www.tdg.ch and www.24heures.ch 
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recently YourNews HAhn et al.ll2007n implemented a content-based system. They both 
use Term Frequency-Inverse Document Frequency (TF-IDF) and the cosine similarity 
between TF-IDF vectors to generate the recommendations. NewsDude has a model for 
long-term interests and another for short-term interests. The long-term model repre- 
sents news stories as Boolean feature vectors, where each feature indicates the pres- 
ence/absence of pre-selected words. A naive Bayesian classifier is used with these vec- 
tors. Short-term interests are captured by converting news stories into a TF-IDF vector 
and applying the k nearest neighbours algorithm with cosine si milarity. 

It i s also possible to combine the two types in a hybrid syste m llBurkel2002l:lLiu et al.l 
l2010l : lLietani2010l1 . For example, Liu et al MLiu et al.ll2010ll extend the Google News 
study by looking at the user click behaviour in order to create accurate user profiles. 
They propose a Bayesian model to recommend news based on the user's interests and 
th e news trend of a group of users. They combine this approach with th e one by Das et 
al HDas et al.ll2007l1 to generate personalized recommendations. Li et al BLi et al.ll2010n 
introduce an algorithm based on a contextual bandit which learns to recommend by 
selecting news stories to serve users based on contextual information about the users 
and stories. At the same time, the algorithm adapts its selection strategy based on 
user-click feedback to maximize the total user clicks. 

Most of these works rely on the history of logged-in users. This causes potential pri- 
vacy issues to the users. Our work departs from this restriction and considers only 
a one-time session for recommendation, where users do not log in. They also discard 
the strong sequential component of reading news stories, which can be modelled as a 
Markov process. Classic recommender system approaches such as collaborative filter- 
ing require recomputing the model every time. In this work, we propose an incremental 
algo rithm that recomputes the model continuously and with little additional computa- 
tion HDimitrak akis 2010], and is thus better suited to such a dynamic domain. 

We focus on a class of recommender systems based on context trees. Usually, these 
trees are used to estimate Variable-order Markov Models (VMM). VMMs have been 
originally applied to lossless data compression, in which a long sequence of symbols is 
represented as a set of contexts and statistics about symbols are combined into a pre- 
dictiv e model IRissanen 1983]. VMMs have many other applications IBegleite r et al.l 
[200l]. ' 

Closely related, variabl e-order hidden M arkov models MWang et al.l I2006t1. hidden 
Markov models IMontgomerv et al.l l2004ll and Markov models HPitkow and Pirolhl 



ll999l: ISarukkai"2 000i:iDeshpande and Karvpiil2004l1 have been extensively studied for 
the related problem of cli ck prediction. Th ese models suffer from high state complex- 
ity. Although techniques HZaki et al.ll2010l1 exist to decrease this complexity, the main 
drawback is that multiple models have to be maintained, making these approaches not 

scalable and n ot suitable for online learning. 

Few works llZimdars et al.l[200l]: IShani et al.l [20051: IRendle et al.''2010l apply such 
Markov models to recommender systems. Zimdars et al [Zimdars et al . 2001] describe 
a sequential model with a fixed history. Predictions are made by learning a forest of de- 
cision trees, one for each item. When the number of items is big, this appro ach does not 
scale . Our approach requires only one tree: the context tree. Shani et al MShani et al.l 
12005?] consider a finite mixture of Markov models with fixed weights. They need to 
maintain a reward function in order to solve a Markov decision process for generating 
recommendations. As future work, they suggest the use of a context-specific mixture of 
weights to improve prediction accuracy. In this work, we follow such an approach. Ren- 
dle et al IRendle et al. 2010] combine matrix factorization and a Markov chain model 
for baskets recommendation. The idea of factoring Markov chains is interesting and 
could be complementary to our approach. However their limitation is that they con- 
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sider only 1-order Markov chain. A bigger order is not tractable because the states are 
baskets which contain many items. 

Due to the singular properties of news, it is not possible to apply these methods 
directly and achieve good recommendations, but instead we need tailor-made models. 
Surprisingly, we do not know of any existing research that considers context-tree mod- 
els to news recommender systems. 

3. CONTEXT-TREE RECOMMENDER 

There are two key ideas behind a Context-Tree (CT) recommender system. Firstly, it 
cuts the data space into a set of refined partitions called a partition tree. Each subset in 
every partition is called a context. The contexts are arranged in a tree structure such 
that for any context in some partition, there is always exactly one context in the previ- 
ous partition that completely contains it. The resulting tree has nodes corresponding 
to each context. A context can be the set of sequences ending in a given suffix, or a set 
of probability distributions. In this work, we focus on sets of sequences of news items 
and topics, as well as sets of topic distributions. 

The second key idea is to assign a local prediction model to each context, called an 
expert. Each expert gives predictions only for a subset of the space. For instance, a 
particular expert gives predictions only for users who have read a particular sequence 
of stories, or users who have read an article that was sufficiently close to a particular 
topic distribution. 

Recommending news articles depends on multiple parameters: the popularity of the 
news item, the freshness of the story, the sequence of news items or topics that the 
user has seen so far. We define an expert model for each of these properties and show 
how to combine them. 

For the sequence of items, we introduce three variations of the CT recommender sys- 
tem: a) the standard Variable-order Markov Model (VMM) system models the context 
as an ordered sequence of news items and the experts predict the next news item, b) 
the Content-based VMM (CVMM) system considers ordered sequences of topics and 
the experts predict the next topic, and c) the Hybrid VMM (HVMM) recommender 
builds a context tree of ordered sequences of topics, but the experts predict the next 
news item. 

The CVMM and HVMM approaches look at sequences of best-matching topic for each 
item. However, it is possible that each context is limited because they assume contexts 
representing sequences of individual topics. We present a context-tree recommender 
system, called k-CT, which builds a tree on the partitions of the fc-dimensional space of 
topic distributions. 

For all those models, each context is associated to an expert who predicts the next 
item. A weight is assigned to each expert expressing how confident the expert is in its 
prediction. Recommendations are made by selecting a path in the context tree (i.e. a 
set of contexts), and combining the weighted predictions of each expert. In the VMM 
approach for example, we select the path matching the current sequence of read news 
items. The predictions are propagated along the path from the most general context 
down to the most specific context, and the weights of the corresponding experts are 
updated at the same time. 

We now detail the CT model and inference procedure for the VMM system. The 
remaining systems use an equivalent procedure. 

3.1. VMM-based Recommender Systems 

Because of the sequentia l nature of news r eading, it is intuitive to model news brows- 
ing as a Markov process llShani et al.ll2005l1 . Readers are in different states at a given 
time, and recommendations are generated by looking at the transition probability from 
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Fig. 1 . Context tree for the sequence s = (ni, n2, n-i,, 712). Nodes in dashed-red are active experts ^ G A{s). 



one state to another. The user's state can be summarised by the last k items visited. 
We refer to such fc-long sequences of items as contexts. A context corresponds to one 
news item, or k news items in the case of a fc-order Markov model. Larger values of k 
lead to contexts that are more informative, but also scarcer. 

Variable-order Ma rkov Models extend M arkov models so that the context length is 
not fixed but varies HBegleiter et al.l 1200411 . This flexibility allows the model to use a 
larger order only in cases where doing so results in better predictions. As a result, it 
has the advantage of performing well when learning on short sequences and on low- 
quality datasets. 

In the following sections, we first consider a context-tree in the space of sequences 
of news articles. Then, using the most probable topic of each news item, we look at 
sequences of read topics. Finally, we combine the two approaches into a hybrid version. 

3.1.1. Standard Recommender. A variable-order Markov model recommender builds a 
context tree representing the sequences of news items. More formally, let M be the 
set of news items and S the set of sequences or visits made by anonymous users. We 
consider a sequence of read news items s = {ni,n2, rit), rii e J\f,s e S. A context ^ is 
a suffix of s, and we write S, < s, when the last elements of s are equal to ^. 

The context tree is a tree T = (V, £) with nodes V and edges £ such that each node 
V, e V corresponds to a unique context Ci, and the fcth node's parent has a context 
Cfc-i ^ Cfc- Specifically, the root node vq corresponds to the empty context ^0 = (0), 
and the child node Vk at depth k has the context ~ nt-k ° ^k-t, where • o ■ is the 
concatenation operator, for example ni o {112,1^3) = (rii, 712, n^). 

Each context ^ is associated with an expert fi, who predicts the next news item. 
For a specific sequence s of news items, a subset ^(s) of experts is active such that 
^(s) = {^i : -< s}. An expert ^i^ has a probability distribution over the news items, 
and we write Fi{nt+i = x\s) for the posterior probability of the next news item being x 
given the sequence s for the expert fi^. 

We associate a weight w,; to each expert U j, and define a Bayesian Variable-order 
Markov Model (BVMM, llDimitrakakisI I2OIOII). Standard approaches use the Context 
Tree Weighting algorithm fW illems et al.|[T995l1 which is defined for binary prediction. 
We use the generalised version, BVMM which incrementally updates the weights with 
a Bayesian rule. The probability ¥{nt+i ~ x\s) of the next news item being x is defined 
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as a mixture of probabilities of all active experts. 

P(nt+i = a:|s) = ^ w,P^{nt+i ^ x\s) (1) 

We can interpret the weights as the confidence of the prediction made by an expert 
for a news item given the current sequence. For a given sequence s we have a corre- 
sponding set of active experts ^(s) which forms a path in the context tree, starting 
from the root expert ^iq to the leaf expert fik- We define the root expert fiQ to have a 
weight of 1: wq = 1, then qq ~ Po(r7,t+i = .t|s). We update the weights of the active 
experts in sequentially as follows: 



Qk 

where 



(2) 



Qk = WkPk{nt+i = x\s) + (1 - Wk)qk-i, (3) 

is the combined prediction of the first k experts. Note that the path in the tree and 
the corresponding set of active experts change for each different sequence s. Therefore, 
the expert fik is not the same across sequences. The parameters of non-active experts 
remain the same. 

For instance, Figured shows the context tree for the sequence s = (ni, ri2, 71.3, 71,2) and 
the active experts ^(s). go is the prediction of the root expert for the next item x, and 
qi is the complete prediction by the model. The parameters of each expert are updated 
subsequently in the manner described in the Section [3^ 

The VMM recommender builds a context tree based on the sequence of news items. A 
news story is about some topics, hence it is possible to model the behaviour of a reader 
as a sequence of topics instead of news items. The following two sections illustrate this 
idea and introduce two variations of the VMM recommender system: the content-based 
VMM and hybrid VMM recommender systems. 

3.7.2. Content-based Recommender. Since the number of news items is very large, per- 
haps a better approach would be to recommend stories that have similar content with 
the ones that a user previously read. This can be done by representing the news sto- 
ries as a vector of features. The system evaluates the similarity between vectors, and 
recommends the set with the most similar vectors of features. With this, even an item 
that has not been read by anyone can be recommended. 

Most approaches to feature representation use TF-IDF in which a set of keywords 
or terms is chosen and for each news item, it generates the frequency of occurrence 
of each keyword but also the frequency of occurrence of each keyword among all news 
stories. It reduces a news story to a vector containing TF-IDF scores for each keyword. 
The major drawback is that it does not capture the inter- or intra-structure of news 
stories very well. 

Instead, we use a probabilistic topic model technique to learn the content. In par- 
ticular, we choose the Latent Dirichlet Allocation (LDA ) over other methods such as 
Probabilistic Latent Semantic Indexing llHofmannll 1999(1 because the later suffers from 
overfitting in practice IBlei et al., 2003,1 . 

The idea of LDA is that a journalist writes an article with particular topics in mind, 
and she draws words with a certain probability from a bag of words of each topic. A 
news story is then represented as a mixture of various topics. The goal is to find a 
mixture of topics for each news item. We write P(z|7i) for the probability distribution 
over topics z in a particular news item n, and P(A|2) for the probability distribution 
over words A for a given topic z. V{z.i — j|n) denotes the probability that the jth topic 
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is assigned to the ith word in news item n, and P(Aj|zj = j) the probabihty of within 
topic j. It follows that 

z 

P(A,|n) = ^P(A.|z, = j)P{z^^j\n) (4) 

Note that we need to specify the number of latent topics Z in advance. Let 0^^^ = 
P(A|z ~ j) be the multinomial distribution over words for topic j and 0*^"' = P(z|n) the 
multinomial distribution over topics for news item n. The parameters and 6 indicate 
which words are important for which topic and which topics are significant for a certain 
item. 9 and have Dirichlet priors with hyperparamet ers a and P respectively. To 
estimate (j) and 9, we use the Gibbs sampling technique llGriffiths and Stewersll200^ 
in which a set of samples from P(z|A) is sufficient for the estimation. 

To apply LDA, we concatenate the title, summary and content of the news item 
together, then we tokenize the words and remove stopwords. After that, we apply LDA 
to all the news stories in the dataset, and obtain a topic distribution vector 6'(") for 
each news item. Note that the topics might have no meaning because they are neither 
classified nor named. 

We can now redefine the context tree of Section [3.1.11 Using the most probable topic 
of each news item, we consider sequences of read topics s = {zi, Z2, z^, zt). The con- 
text ^ is a suffix of a sequence s of read topics. The remainder of the model stays the 
same as in the VMM recommender system. The only difference is that we have top- 
ics instead of news items. Hence we cannot recommend news items directly, but we 
generate recommendations by combining the probability of the next topic with the 
probability of that topic for each news item. The score score{n\s) of a news item n is 
given by 

score{n\s) = max{P{zt+i ~ j|s)P(z = j\n)} (5) 
j 

The system evaluates each candidate news story and recommends the news items with 
the highest scores. We name this system the Content-based VMM recommender system 
(CVMM RecSys). 

3.1.3. Hybrid Recommender. We combine the standard VMM with the content-based 
VMM recommender system into a hybrid version. The context tree is built on topics, 
similarly to the CVMM system, but the experts make predictions about news items, 
like the VMM system. 

The Hybrid VMM recommender system (HVMM RecSys) builds a tree in the space 
of topic sequences. The context ^ is a suffix on a sequence of most probable topics s = 
{zi, Z2, Z3, zt). The sets of fresh T and popular P items contain news stories and 
not topics. All probabilities (Equation [l] and the related Equations) are defined with 
respect to news items. 

The tree structure is very limited because the context ^ is constrained to a sequence 
of individual topics. In the next section, we lift this restriction to fully exploit the topic 
distribution. 

3.2. k-d Context-Tree Recommender 

For a given news story, we obtain a topic distribution. The CVMM and HVMM struc- 
tures seen before use only the most probable topic to construct the sequence. However, 
perhaps the complete topic distribution of the last news item is more important than 
the temporal sequence of most probable topics. For this reason, we use a fc-d tree to 
build a context model in the space of topic distributions. 
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A fc-d tree is a binary tree that partitions a fc-dimensional space into smaller spaces 
llBentlevll 1975(1 ■ A node corresponds to a hyperplane splitting the space of topic distri- 
butions into two hypercubes. We associate one node to one of the fc-dimensions, and 
its corresponding hyperplane is perpendicular to that dimension's axis. A leaf stores 
at least one topic distribution. For instance, a node associated to dimension d splits 
the space into two subtrees at 0^: a topic distribution 6' = {9[,92, ...,9'^) with a smaller 
value for the dimension d, i.e. 9[i < 9d lies in the left subtree and with a larger value 
9'^ > 9d will be in the right subtree. 

There are various ways to construct a fc-d tree, depending on the chosen partitioning 
strategy. A simple idea is to select the axis based on the depth such that we cycle 
through all possible axis: axis = depth mod k. 

In order to use this structure as a context tree, we assign a context ^ to each node 
as before, however the context represents a subset of the possible topic distributions. 
Every time the system observes a new topic distribution, the distribution is added to 
the fc-d tree, and possibly the tree expands. We refer to this method as the k-d Context- 
Tree recommender system (fc-CT RecSys). 

3.3. Expert Model 

Recommending news articles depends on multiple factors: the popularity of the news 
item, the freshness of the story, the sequence of news items or topics that the user 
has seen so far. We define a model for each of these properties, and show how to com- 
bine them. The first model ignores the temporal dynamics of the process. The second 
model accounts for the possibility that the users may be mainly looking at popular 
items, while the last model assumes that users are mostly interested in fresh items 
(i.e. breaking news). Each context corresponds to an expert ^i, which calculates the 
posterior probability Pi(x|s) of the next item x given any sequence s in its correspond- 
ing context. 

All three models are constructed through Dirichlet priors. A prior mass is assigned 
to all possible outcomes, such that the prior probability that the outcome x is in some 
set A is proportional to the mass of A. More formally, if m{A) is the mass of A, then 
F{x eA\xeB,AcB) = m{A)/m{B). The prior is updated via counting: whenever 
a new item x is read, the mass of all sets containing x increases by 1, so that m'(A) ~ 
m{A) + I{x e A], where I{a; G A] equals 1 if x £ A, otherwise. 

3.3. 1. Standard. A naive approach for estimating the multinomial probability distribu- 
tion over the news items is to use a Dirichlet distribution on multinomial parameters 
for each expert . The probability of reading a particular news item x depends only on 
the number of times it has been read when the expert is active. 

Pf ''(n.+i ^x\s)= "- + "° (6) 

where is the initial count of the Dirichlet distribution. 

The dynamic of news items is more complex. A news item provides new content and 
therefore has been seen by few users. News is subject to trends and frequent variations 
of preferences. We improve this simple model by augmenting it with models for popular 
or fresh news items. 

3.3.2. Popularity. A news item a; e P is in the set of popular items V when it has been 
read at least once among the last r = \V\ read news items. We compute the probability 
of a news item x given that x is popular as: 

Pr = x|s) = ^ , (7) 
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where is the total number of dicks received for news item x. Note that is not equal 
to a.^ (Eq.[6ll. ax is the number of clicks for news item x when the expert is active, while 
Cx is the number of clicks received by news item x in total whether the expert is active 
or not. 

The number of popular items [T'l is important because it is unique for each news 
website. When \V\ is small, the expert considers only the most recent read news. It is 
possible to tune this parameter to achieve better performance. 

3.3.3. Freshness. A news item x G J-" is in the set of fresh items T when it has not 
been read by anyone but is among the next s = | J"! news items to be published on the 
website, i.e. a breaking news. We compute the probability of news item x given that x 
is fresh as: 

ifx e T 



n-\n,^, =x\s) = l ..I' (8) 

The number of fresh items \T\ influences the prediction made by this expert, and it 
is also unique for each news website. 

3.3.4. Mixing the expert models. We combine the three expert models using the following 
mixture: 

P,(nt+i = x\s) = Vf\nt+i = x\s)pf'' 

+ Pr^(nt+i = x\s)pI°^ (9) 

There are two ways to compute the probabilities pf"^, pf °^ and pl"^"^"^'^: either by using 
a Dirichlet prior that ignores the expert prediction or by a Bayesian update to calculate 
the posterior probability of each expert according to their accuracy. 

For the first approach, the probability of the next news item being popular is: 

pop Tm / ^ T-i\ ^pop QiQ 



[O^pop + 0^0)^ K^notpop OiO) 
rv -4- rvn 

(10) 



2ao + E, 



a,- 



where J2j represents the number of times the expert p,i has been active, apop and 
oinotpop the number of read news items which were respectively popular and not popu- 
lar when the expert //^ was active. 

Similarly, the probability of the next news item being fresh is given by: 

fresh m, / ^ -c\ Ct/res/i + CtQ 

Pi = Pj(rit+i e J^j = — -^^^ (11) 



2«o + E 



a. 



where a fresh is the number of read news items which were fresh when the expert p-i 
was active. 

Finally, the probability of the next news item being neither popular nor fresh is: 

pf" = P,(7it+i iVUF) = l- P,(nt+i G P) - P.(nt+i G F) (12) 
Note that P n J" = 0. 

It might happen that by using the Dirichlet priors, predictions are mainly made by 
only one expert model. To overcome this issue, we compute the probabilities pf''-, 
and p^^"^"^^ via a Bayesian update, which adapts them based on the performance of each 
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expert model: 



Pi ^ 



?i{nt+i = x\s) 



pop ^ ^r^{nt+i = x\s)pP°P 

P.K+i=.|s) ^^^^ 

fresh , Jr'i (^t+l ~ X S p- 

V <r- — 



Algorithm 1 VMM recommender system 



procedure LEARN(a;, s, context set ^) 

t ^ |-4(s)| // number of experts 

II loop from the most general expert to the most specific expert [it ■ 
for I <— 0, i do 

Pi ^ Pi(nf+i = x|s) 
q ^ WiPi + (1 - Wt)q 
w, ^ ^ 

<■ q 

if BayesianUpdate then 

update pf^pf^pf''"^'' (Eq.lUll 

update ax,apop, Cx, afresh according to x 
II if the context is not in the tree, then add a new leaf node, 
if a; o s ^ S then 

S = E\J{x o s} 

end procedure 
procedure Recommend(s) 
for all candidate n e C do 

t^|^(s)| 

// loop from the most general expert /iq to the most specific expert fit- 
for i <— 0, t do 

^ w^Fi{ji\s) + (1 - Wi)?'"' 
TZ sort all n G C by g^"^ in descending order 
return first k elements of TZ 
end procedure 



Algorithm [T] presents a sketch of the context-tree recommender algorithm. For sim- 
plicity, we split the recommender system in two procedures: learn and recommend. 
However, our implementation combines the two to have a complete online algorithm 
which makes recommendations while learning, and thus do not need offline compu- 
tation. The system estimates the probability of each candidate and recommends the 
news items with the highest probability. When the recommender system needs to esti- 
mate the probability of a candidate item, the system 1) selects the active experts ^(s) 
which correspond to a path in the context tree from the most general context to the 
most specific context, 2) propagates q from the root down to the leaf, i.e the most spe- 
cific context. The q at the leaf expert is the estimate probability of the recommender 
system for the candidate item x, i.e. V{nt+i — x\s) (see Eq.[l]f. 
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Table I. Datasets after filtering. 

News stories Visits Clicks 

TDG 10'400 60Q'256 1'069'131 

~24H 8'613 249'099 509'978 

4. EVALUATION AND COMPARISON 

In this evaluation, we are interested in whether the class of CT recommender systems 
has an advantage over standard methods and if yes, what is the best combination of 
partition and expert model under an arbitrary usage scenario. 
More specifically, we answer the following questions: 

a) What role do the different experts play? What is the best way to update the weights 
of their mixture? 

b) Is the temporal sequence important when recommending news articles? 

c) Does the content of the news stories help to make good recommendation? 

d) Do CT recommender systems make novel recommendations? 

Novelty is essential because it exposes the reader to relevant news items that she 
would not have seen by herself Obvious but accurate recommendations of most- 
popular items are of little use. 

We evaluate our systems on two datasets. We use the first dataset to examine the 
sensitivity of the CT models to hyperparameters and compare them to existing tech- 
niques. For the second part of the experiments, we perform an unbiased compari- 
son between the different CT models, whereby we first select a particular evaluation 
criterion, then we select the optimal hyperparameters for that criterion on the first 
dataset, and then m easure the performance on the second dataset. This methodology 
IIBengioetal.l[2005ll mirrors the approach that would be followed by a practitioner who 
wants to implement a recommender system on a newspaper website. 

4.1. Datasets 

We collected data from the websites of two daily Swiss-French newspapers called Tri- 
bune de Geneve (TDG) and 24 Heures (24H). TDG and 24H are the most-popular news- 
paper^ in Cantons of Geneva and Vaud, respectively. Their websites contain news 
stories ranging from local news, national and international events, sports to culture 
and entertainment. 

The datasets span from Nov. 2008 until May 2009. They contain all the news stories 
displayed, and all the visits by anonymous users within the time period. Note that a 
new visit is created every time a user browses the website, even if she browsed the 
website before. The raw data has a lot of noise due to, for instance, crawling hots from 
search engines or browsing on mobile devices with unreliable internet connections. Ta- 
ble [J shows the dataset statistics after filtering out this noise, and Figure H] illustrates 
the distribution of visit length for each dataset. 

4.2. Evaluation Metrics 

There has been a lot of discussion on the best way of eva luating recommender sys- 
tems fHerloc ker et al.l l2004: S hani and Gunawardanall201lll . The best would be to im- 
plement them on an actual site and measure the click rate on recommended items. 
Unfortunately, this is usually far too costly to do, and evaluation has to be carried out 
based on behaviour that was observed without the recommender being present. In our 
case, we have visit histories from the newspaper websites, and we can evaluate how 



2ln 2011, TDG and 24H had a readership of 138'000 and 223'000, and a circulation of 51'487 and 75796, 
respectively. 
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Fig. 2. Distribution of the length of visits. 



well our recommendations match the news items that readers selected themselves. It 
is clear that this is a somewhat inaccurate measure: a) the user may not have liked all 
the items she visited; b) the user may have preferred one of the recommended items to 
the one she clicked, so the fact that a recommended item was not visited does not mean 
the recommendation is bad. However, we believe that prediction of the visit history is 
still a useful way to compare the performance of different techniques, and so we use it 
here. 

We evaluate how good the systems are in predicting the future news a user is going 
to read. Specifically, we consider sequences Si e 5 of news items Si = (m, n2, ns, ri;), 
Ui G J\f,Si e S read by anonymous users. The sequences and the news items in each 
sequence are sorted by increasing order of visit time. When an anonymous user starts 
to read a news item rii, the system generates recommendations. As soon as the user 
reads another news item n2, the system updates its model with the past observations 
ni and n2, and generates a new set of recommendations. Hence the training set and 
the testing set are split based on the current time: at time t, the training set contains 
all news items accessed before t, and the testing set has items accessed after t. 

For a given sequence s — (ni,n2, ...,?it, ■■■,'ni) and the current news item rit in this 
sequence, we define S as the set of successor news items such that S = {ui : i > t}, 
and TZ as the set of recommended news items. We say that a recommended news item 
is relevant if it is in the successor set. We always recommend 5 news stories, and we 
use two metrics to evaluate how good the recommendations are: Success@5 iS@5) and 
Mean Average Precision (MAP). 

Success@5 is equal to 1 if the immediate successor of the current items is recom- 
mended among the first 5 recommended news stories, otherwise. 

It is interesting to consider the order in which the recommended news are presented 
by the system. Since the recommendation set is actually an ordered list of recom- 
mended news, we can compute the precision P@k at every position k in the ranked 
sequence of news stories. 



P@k = 



{relevant and recommended news of rank k or less}| 



(14) 



k 
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Fig. 3. VMM recommender system: different mixtures of experts (Bayesian update, \T\ = 10). 

The average precision at the rank of each relevant news stories is calculated as: 



\sr\n\ 



J2P®k-rel{k) 



(15) 



k=l 



where n is the number of recommendations. rel{k) equals 1 if the news at rank k is 
relevant, otherwise. 

Finally, the Mean Average Precision is the mean of the average precision for a given 
set of queries Q, i.e. for each recommendation set the system generates. 



MAP : 



\Q\ 



(16) 



Success@5 captures how good recommended news stories are against the immediate 
successor. However, MAP looks at all future news stories and how the recommended 
news are ordered. 

We briefly recall the systems we evaluate: 

VMM RecSys. is the standard VMM recommender system in which browsing be- 
haviours are modelled as an ordered sequence of news items (Sec. l3.lIT] >. 
CVMM RecSys. is a pure content- based approach where each news story is labelled 
with the most probable topic (Sec. |3.1.2l >. 

HVMM RecSys. is a hybrid VMM recommender system which brings together the 
structure of the content-based system with the prediction of the standard VMM 
method. (Sec lSlTSl l. 

k-CT RecSys. is a variation of the hybrid VMM recommender system, but with a 
different context-tree structure using the entire topic distribution (Sec. 13. 2D . 



4.3. Results 

For all systems, we use a prior of = 1/|AA| for the Dirichlet models and the initial 
weights for the experts as Wfe = 2^'^, where k is the depth of the node the expert is 
assigned to. For the topic-based solutions, we evaluated experimentally the optimal 
value in the range from 30 to 500 topics. Increasing the number of topics did not raise 
significantly the performance. So we decided to set the number of topics to 50 because 
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(a) VMM and fc-order Markov chain (b) context-tree recommender systems 

Fig. 4. Accuracy for personalized news items (std + pop + fresh, | = 10). 

it is a reasonable choice between accuracy and complexity. We varied the number of 
popular items |P| from 10 to 500. When |P| is small, the experts consider only the 
most recent read news stories as candidates. The number of fresh items \T\ from 10 to 
100, the mixture of experts (standard, popularity and/or freshness) and whether the 
probabilities are computed via Bayesian update or not. We report averages over all 
recommendations with confidence intervals at 95%. Due to space constraints, we omit 
figures for the TDG dataset, but we witnessed the same behaviours. 

a) What role do the different experts play? What is the best way to update the 
weights of their mixture? 

The mixture of expert models plays an important role in the performance. Bayesian 
update for the weights pi is more robust. 

For instance in Figure [3l mixtures integrating the popularity model are very sensitive 
to the number of popular items while others are more robust. We see that there is 
an optimal number of popular items for which a recommender system gives the best 
accuracy, but also that the strategy to always recommend the most popular items does 
not pay off when the number of popular items increases. "Good" recommendations are 
drowned in popular items. Although naive, this approach of recommending the most 
popular stories is actually used very often on newspaper websites. 

We noticed that, when using the Dirichlet priors to update the mixture probabilities, 
the prediction was mostly made by the popularity model, resulting in the same be- 
haviour as the most-popular recommender system as the number of popular items |P 
increases. However, as the Bayesian update (Eq. [131 ) adapts the probabilities based on 
the performance of each expert model, it is more robust when we increase \V\. We also 
observed that as the number of fresh items increases, CT models are getting slightly 
better for both metrics. 

b) Is the temporal sequence important when recommending news articles? 

Yes, the temporal sequence increases the accuracy. 

Figure |4(a)| shows that the VMM recommender system performs bette r than fixed fc- 
order Markov chain recommenders such as the ones by Zimdars et al MZimdars et al.l 
[20011. In Figure [H we consider only personalized items. For each approach, we re- 
moved the popular items from the recommendation set TZ, and get a reduced set 
Tip = TZ\ TZt- The set TZt contains the most popular news stories recommended by 
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Fig. 5. Weights distribution of the VMM over the depth of the context tree (Bayesian update, ["PI = 60, 
\T\ = 10). 
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Fig. 6. Accuracy and novelty for context-tree recommender systems (std + pop + fresh, = 10). 

the most-popular approach. Therefore, the set TZp has only personalized recommenda- 
tions. 

In addition, the weights of the experts for the VMM recommender system are well 
distributed over the space even for long sequences (Fig. [Sjl. If the sequence is not im- 
portant, the weights of the experts for depths higher than 1 would have been set to 
0. 

c) Does the content of the news stories help to make good recommendation? 

No, the content of news stories does not help. 

For instance in Figure |4(b)| we focus only on novel recommendations since the most 
popular items are taken out. Pure content-based approaches such as CVMM and fc- 
CT systems perform particularly poorly. CVMM has also the worst performance in the 
comparison of CT models presented in the next section. We do not observe that the 
content allows both novel and accurate recommendation, which contradicts common 
wisdom in the research community. 
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Fig. 7. Expected performance curves: accuracy and novelty trade-off for context-tree recommenders. 

d) Do CT recommender systems make novel recommendations? 

Yes, but there is a trade-off between accuracy and novelty. 

We define the novelty as the ratio of unseen and recommended items over the rec- 
ommended items: \TZ n T\/\TZ\. In general, CT recommenders generate novel recom- 
mendations (see Fig. [g]). However, CVMM does not provide any novel items. k-CT rec- 
ommender generates a lot of novel items, and seems to be the best trade-off between 
acc uracy and novelty. However, we are not s ure about the accuracy of novel items (see 
Fig. |4(b)| l since as discussed in Section 14.21 we are only looking at traces and people 
may not have found the novel items. Hence the fc-CT recommender could still be a very 
good option in practice as it generates a lot of novel results. 

4.4. Comparison 

The best way to implement a recommender system depends upon the system designer's 
goals, or the user's preferences. For news recommendation in particular, we are facing 
a trade-off between novelty and accuracy. We formalise the preferences of the designer 
with respect to this trade-off, via the following utility function: 



where uj specifies how the trade-off between accuracy and novelty is made, D is the 
dataset (24H or TDG) and A is an assignment of parameters. For CT systems, the 
parameters are the number of popular I'P] and fresh \T\ items, whether the probabili- 
ties are computed via a Bayesian update or not, the mixture of experts (standard and 
(popularity and/or freshness)). 

We can now simulate the process of a designer who is going to tune the recom- 
mender system on a small dataset (24H), before deploying the recommender online 
(on the TDG dataset). For any given value of uj, we find the best parameters for the 
24H dataset, and then meas ure the performanc e on the other dataset. This gives the 
expected performance curve MBengio et al.ll2005ll . which provides us an unbiased eval- 
uation of all systems' performance. 

Figure [7] illustrates the expected performance curves for both 24H and TDG 
datasets. The first curve in Figure [7(a)] shows the optimal u tility A*{io, D)) with 

A*{uj, D) = argmax/i U{u}\D, A) for D = 24H dataset. Figure [7(K)1 shows the correspond- 
ing utility U{uj\D' , A*{uj, D)) achieved on the test dataset D' = TDG using the param- 



U{uj\D, A) = bj * s@5 + {1 - uj) * novelty, 



(17) 
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eters found for the tuning dataset. It is clear that all methods are robust enough that 
their curves are the same for both datasets even though the parameters are tuned only 
on the smallest one. No matter how the trade-off cj is selected, the pure content-based 
method is worse than all of the other approaches. This shows that the topic alone does 
not identify which article an individual will read. 

5. CONCLUSION 

Because of the abundance of news on the web, news recommendation is an important 
problem, but also challenging due to the natural properties of a news item: when a 
news story is very recent, there is little data available to generate recommendations. 
Moreover, it is subject to trends and preference changes over time. 

In this paper, we introduced a class of recommender systems based on context trees 
that bring relevant and interesting news articles to readers. Classic recommender sys- 
tem approaches such as collaborative filtering require to recompute the model every 
time. In this work, we proposed an incremental algorithm that recomputes the model 
continuously, and is thus better suited to such a dynamic domain. 

We considered different context trees in the space of sequences of news, sequences of 
topics, and in the space of topic distributions. More specifically, we presented the VMM 
recommender in which browsing behaviours are modelled as an ordered sequence of 
news items; the CVMM recommender where each news story is labelled with the most 
probable topic; the HVMM which brings together the structure of CVMM system with 
the prediction of the standard VMM method; and finally the fc-CT recommender using 
the entire topic distribution. 

In the context of news recommendations, we defined different expert models which 
consider the popularity and freshness of news items, and examined ways to combine 
them into a single model. 

In conclusion, we showed that CT recommender systems are flexible enough to cap- 
ture the properties of news items, and perform better than existing techniques. 

We demonstrated that a) a sophisticated expert model can improve recommenda- 
tion quality, b) the temporal sequence is important for recommendation, c) the content 
does not help in making good recommendations because the topic alone is not enough 
for personalized recommendations. We think this is because users do not like to read 
multiple stories about the same topic, d) CT models achieve a good trade-off between 
novelty and accuracy. Individual behaviour plays an important role. Finding a good 
model that characterizes this individual behaviour is an open research question. We 
believe that news websites should consider these techniques for keeping readers inter- 
ested in their sites. 

For future work, we would like to examine different expert models. For example, we 
could additionally consider the time a reader spends on a given news article or topic. 
Finally, in order to accurately evaluate the performance of systems that recommend 
many novel items, we intend to conduct an online user study. 
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